Skip to content

Resolve AppKit and Agent Skills versions from compatibility manifest#5139

Open
pkosiec wants to merge 7 commits intomainfrom
pkosiec/appkit-version-pinning
Open

Resolve AppKit and Agent Skills versions from compatibility manifest#5139
pkosiec wants to merge 7 commits intomainfrom
pkosiec/appkit-version-pinning

Conversation

@pkosiec
Copy link
Copy Markdown
Member

@pkosiec pkosiec commented Apr 30, 2026

Summary

Introduces a CLI compatibility manifest (internal/build/cli-compat.json) that maps CLI versions to compatible AppKit template and Agent Skills versions. This enables template updates to reach users without CLI releases.

The manifest is resolved with a 3-tier fallback:

  1. Fresh local cache (< 1h)
  2. Remote fetch from GitHub (with retry)
  3. Stale local cache or embedded manifest fallback

Companion PRs:

Screenshot

image

@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 15:10 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 15:10 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 15:20 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 15:20 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:12 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:12 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:16 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:16 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:25 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:25 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:35 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is April 30, 2026 16:35 — with GitHub Actions Inactive
@pkosiec pkosiec marked this pull request as ready for review April 30, 2026 16:39
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Approval status: pending

/cmd/apps/ - needs approval

Files: cmd/apps/init.go, cmd/apps/init_test.go, cmd/apps/manifest.go
Suggested: @MarioCadenas
Also eligible: @arsenyinfo, @keugenek, @calvarjorge, @fjakobs, @jamesbroadhead, @Shridhad, @atilafassina, @igrekun, @pffigueiredo, @ditadi

/experimental/aitools/ - needs approval

9 files changed
Suggested: @MarioCadenas
Also eligible: @arsenyinfo, @keugenek, @calvarjorge, @fjakobs, @jamesbroadhead, @Shridhad, @atilafassina, @igrekun, @pffigueiredo, @ditadi, @lennartkats-db

/internal/ - needs approval

Files: internal/build/README.md, internal/build/cli-compat.json, internal/build/clicompat.go
Suggested: @simonfaltum
Also eligible: @renaudhartert-db, @hectorcast-db, @parthban-db, @tanmay-db, @Divyansh-db, @tejaskochar-db, @mihaimitrea-db, @chrisst, @rauchy

General files (require maintainer)

4 files changed
Based on git history:

  • @pietern -- recent work in internal/build/, experimental/aitools/lib/installer/, cmd/apps/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

Copy link
Copy Markdown
Contributor

@renaudhartert-db renaudhartert-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @pkosiec. Could we have a chat internally about what you're trying to achieve? I'd like to make sure that this is aligned with the overall direction we're planning to evolve that command toward.

@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 12:35 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 12:35 — with GitHub Actions Inactive
@pkosiec pkosiec changed the title feat: resolve AppKit template version from compatibility manifest feat: CLI compatibility manifest with 3-tier fallback May 5, 2026
@pkosiec pkosiec marked this pull request as draft May 5, 2026 12:43
@pkosiec pkosiec force-pushed the pkosiec/appkit-version-pinning branch from d7d005f to 3241ae2 Compare May 5, 2026 13:00
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:00 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:00 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:17 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:17 — with GitHub Actions Inactive
@pkosiec pkosiec changed the title feat: CLI compatibility manifest with 3-tier fallback Resolve AppKit and Agent Skills versions from compatibility manifest May 5, 2026
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:24 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:24 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:30 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:30 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:59 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 5, 2026 13:59 — with GitHub Actions Inactive
@pkosiec pkosiec marked this pull request as ready for review May 5, 2026 13:59
@pkosiec pkosiec temporarily deployed to test-trigger-is May 6, 2026 08:03 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 6, 2026 08:03 — with GitHub Actions Inactive
@pkosiec
Copy link
Copy Markdown
Member Author

pkosiec commented May 6, 2026

@renaudhartert-db @simonfaltum Please take a look 🙏 I applied all the suggestions after discussion with Simon (thanks for the feedback!). Now the manifest resides in the CLI repo.

pkosiec added 7 commits May 6, 2026 15:38
- Embed cli-compat.json in the CLI binary with build-time fetch
- Resolve AppKit and Agent Skills versions from the manifest
- Add 1h local cache and retry for runtime manifest fetches
- Show default AppKit version in --version flag help
- Print resolved skills version during aitools install

Co-authored-by: Isaac
- Guard printVersionLine against empty latestRef to prevent confusing
  "Update available: v" output when skills version resolution fails
- Sort versionedKeys in TestEmbeddedManifest_IsWellFormed to prevent
  flaky test from map iteration randomness
- Preserve embedded manifest parse error in FetchManifest error message
- Add test for GetSkillsRef fallback to embedded manifest

Co-authored-by: Isaac
- Rename libs/depversions/ to libs/clicompat/ (package + files)
- Rename internal/build/dep_versions.go to clicompat.go
- Rename EmbeddedManifestJSON to CLICompatManifestJSON
- Extract devVersionPrefix const for "0.0.0-dev"
- Wrap both errors with %w in FetchManifest fallback
- Add template-v tag validation to bump-cli-compat skill
- Remove confusing positional args from skill (named flags only)
- Fix README: "After each AppKit or Agent Skills release"

Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
- Fix stale libs/depversions/ references in README (renamed to libs/clicompat/)
- Fix incorrect "next" key description (only used for dev builds, not newer-than-all)
- Fix import ordering (clicompat before cmdctx/cmdio alphabetically)
- Fix slices import grouping in clicompat_test.go (stdlib, not separate group)
- Fix error format: semicolon instead of period before hint text
- Fix FetchManifest godoc: "4-tier fallback" to match numbered list
- Fix writeLocalManifest comment: explain temp-file-then-rename pattern
- Fix README example values to match actual manifest
- Fix flaky test: pre-populate cache to avoid real network calls

Co-authored-by: Isaac
When the manifest resolves to a version that doesn't exist as a git tag
(404), retry with the version from the embedded manifest. Only triggers
on "not found" errors, not transient network failures.

Also:
- Rename EmbeddedDefaultAppKitVersion/EmbeddedResolve* to
  ResolveEmbeddedAppKitVersion/ResolveEmbeddedAgentSkillsVersion
- Remove duplicate log lines (keep only log.Warnf, drop cmdio.LogString)
- Drop ctx parameter from ResolveEmbedded* (not needed)

Signed-off-by: Pawel Kosiec <pawel.kosiec@databricks.com>
The linter flagged `tag = fallbackVersion` as dead code because `tag`
is never read after the assignment. The variable is only needed earlier
in the function for logging and comparison.

Co-authored-by: Isaac
@pkosiec pkosiec force-pushed the pkosiec/appkit-version-pinning branch from 4806f28 to 877ba5c Compare May 6, 2026 13:42
@pkosiec pkosiec temporarily deployed to test-trigger-is May 6, 2026 13:42 — with GitHub Actions Inactive
@pkosiec pkosiec temporarily deployed to test-trigger-is May 6, 2026 13:42 — with GitHub Actions Inactive
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

I ran a multi-agent review (Isaac + Cursor) on this PR. Both reviewers converged on REQUEST_CHANGES: the 4-tier fallback design is sound and tests are substantial, but the live remote manifest is treated as more trusted than its current validation justifies, and the embedded fallback is applied inconsistently across consumers.

Must-fix before merge:

  1. parseManifest doesn't validate appkit/skills values — empty/non-semver values fall through and git clone ends up using AppKit's default branch instead of a pinned tag.
  2. Embedded not-found fallback is missing from UpdateSkills, runManifestOnly, and defaultListSkills — inconsistent recovery across commands.
  3. IsNotFoundError uses fragile substring matching — replace with a sentinel error wrapped at the source.
  4. Resolve returns the lowest entry when CLI < min(keys) — needs a documented never-prune policy or fail-and-fallback to embedded.
  5. writeLocalManifest removes the cache before rename — failed rename leaves the user with no cache, defeating tier 3a.

Other findings posted inline below: 4xx retry suppression, missing User-Agent, debug-only cache write errors, naming/test cleanups, and questions about rollback story and offline first-run latency.

Some of these may overlap with the changes already requested by @renaudhartert-db; happy to defer where they do.

}
}
return m, nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: parseManifest doesn't validate the appkit/skills values inside each entry.

A live manifest like {"next":{"skills":"0.1.5"},"0.300.0":{"skills":"0.1.5"}} parses successfully and yields an empty AppKit version. Empty strings normalize to an empty git ref, and git clone then uses the AppKit repo's default branch instead of a pinned template-v... tag. Non-semver values cause similar silent drift.

Suggested fix: after the existing checks in parseManifest, iterate entries and reject any whose AppKit or AgentSkills is empty or fails a semver check. Add tests for missing/invalid appkit and skills values. Keep unknown extra fields permitted for forward-compat.

Comment on lines +84 to +86
latestTag, err := GetSkillsRef(ctx)
if err != nil {
return nil, err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: UpdateSkills doesn't fall back to the embedded manifest on not-found.

(Commenting here on the changed lines; the issue is the src.FetchManifest(ctx, latestTag) call that follows shortly.)

InstallSkillsForAgents (in installer.go) catches IsNotFoundError from FetchManifest and falls back to the embedded skills version. UpdateSkills does not. If a bad cli-compat.json is published (or sits in the user's local cache for up to an hour) pointing to a tag that does not exist on databricks/databricks-agent-skills:

  • databricks experimental aitools install recovers via embedded
  • databricks experimental aitools update fails outright until the manifest is fixed

A user who installed from the embedded fallback is permanently stuck on update-fail until the manifest is corrected. This is precisely the failure mode the embedded fallback is designed to prevent.

Cursor caught the same pattern in aitools list and apps init --manifest. Suggested fix: extract the not-found-then-embedded-fallback into a small helper (e.g. fetchSkillsManifestWithFallback(ctx, src, ref)) and use it from all four consumers.

Comment thread cmd/apps/manifest.go
Comment on lines +31 to 38
appkitVersion, err := clicompat.ResolveAppKitVersion(ctx)
if err != nil {
return fmt.Errorf("could not resolve AppKit template version: %w; use --version to specify a version manually", err)
}
gitRef = normalizeVersion(appkitVersion)
}
templateSrc = appkitRepoURL
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: runManifestOnly has no fallback when the resolved tag doesn't exist.

cmd/apps/init.go falls back to the embedded AppKit version when the resolved template tag returns not-found (via the awaitTemplate/IsNotFoundError dance). runManifestOnly calls clicompat.ResolveAppKitVersion and then proceeds to clone with no equivalent fallback. If the manifest points to a tag that doesn't exist (the same scenario the embedded fallback was added to handle), apps init --output-manifest errors out instead of recovering.

Suggested fix: pair this with the centralization noted in update.go. Push the fallback inside clicompat/the template-resolution layer so all default-ref consumers benefit automatically (preferred), or duplicate the embedded-fallback dance from init.go here.

Comment on lines +51 to +53
ref, err := installer.GetSkillsRef(ctx)
if err != nil {
return err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: defaultListSkills has no embedded fallback.

(Commenting on the changed lines; the issue is the src.FetchManifest call right after.)

Same pattern as update.go and apps/manifest.go: this fetches with the resolved ref and returns the error directly. A bad/typo'd manifest entry breaks aitools list even though a known-good embedded version exists.

Should be fixed with a centralized helper used by all default-ref consumers.

Comment on lines +143 to +149
func IsNotFoundError(err error) bool {
if err == nil {
return false
}
msg := strings.ToLower(err.Error())
return strings.Contains(msg, "not found") || strings.Contains(msg, "404")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: IsNotFoundError is fragile substring matching across package boundaries.

The errors this inspects come from different packages (installer.GitHubManifestSource.FetchManifest, cmd/apps/init.go's awaitTemplate). Any string change in those callers silently breaks the fallback. It also matches false positives (e.g. "hostname not found", "exit code 404") and misses true positives like HTTP 410 Gone or git "reference does not exist" wording.

Suggested fix: define a sentinel var ErrNotFound = errors.New("compat target not found") in clicompat, and wrap returned errors at the source — installer/source.go already classifies HTTP 404 and could return fmt.Errorf("...: %w", clicompat.ErrNotFound). Then callers use errors.Is(err, clicompat.ErrNotFound). This decouples the fallback from error message wording.


// httpClient is the HTTP client used for manifest fetches. Package-level var
// so tests can replace it.
var httpClient = &http.Client{Timeout: fetchTimeout}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: httpClient is a mutable package var swapped by tests (clicompat_test.go). Fine today since tests don't t.Parallel(), but the moment anyone tries to parallelize this package it'll race.

Lower-effort fix: leave it but add a comment that this package is intentionally not parallel-test-safe. Better: inject the *http.Client into a small struct or pass via context.

// --- Local manifest cache ---

// manifestLocalPath returns the path to the locally cached manifest file.
func manifestLocalPath(ctx context.Context) string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if os.UserCacheDir() fails (rare, e.g., no HOME), manifestLocalPath silently returns "", which silently no-ops every cache read/write.

Suggested fix: log.Debugf(ctx, "Cannot determine cache dir: %v", err) before returning "".

Comment on lines +113 to +128
func ResolveEmbeddedAppKitVersion() (string, error) {
m, err := parseManifest(build.CLICompatManifestJSON)
if err != nil {
return "", fmt.Errorf("embedded manifest: %w", err)
}
entry, err := Resolve(m, build.GetInfo().Version)
if err != nil {
return "", fmt.Errorf("embedded manifest resolve: %w", err)
}
return entry.AppKit, nil
}

// ResolveEmbeddedAgentSkillsVersion resolves the Agent Skills version from only
// the embedded manifest for the current CLI version. Used as a fallback when the
// primary version points to a non-existent tag.
func ResolveEmbeddedAgentSkillsVersion() (string, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ResolveEmbeddedAppKitVersion and ResolveEmbeddedAgentSkillsVersion re-parse the embedded JSON on every call. Embedded JSON is small so this is negligible, but it's trivially cacheable:

var embeddedManifest = sync.OnceValues(func() (Manifest, error) {
    return parseManifest(build.CLICompatManifestJSON)
})

Use that here and in tier 3b of FetchManifest.

Comment on lines +70 to +99
// 1. Local cached file (if fresh, < 1 hour old)
// 2. Remote fetch from GitHub (with retry)
// 3. Stale local file (if remote fails but a previously cached file exists)
// 4. Embedded manifest compiled into the binary
func FetchManifest(ctx context.Context) (Manifest, error) {
localPath := manifestLocalPath(ctx)

// Read local file once — reuse across tiers.
local, localErr := readLocalManifest(localPath)

// Tier 1: local file is fresh.
if localErr == nil && local.isFresh(cacheTTL) {
log.Debugf(ctx, "Using cached manifest from %s", localPath)
return local.manifest, nil
}

// Tier 2: fetch from remote (local file missing or stale).
m, fetchErr := fetchRemoteWithRetry(ctx)
if fetchErr == nil {
writeLocalManifest(ctx, localPath, m)
return m, nil
}

// Tier 3a: local file exists but stale — use it anyway.
if localErr == nil {
log.Debugf(ctx, "Using stale cached manifest (remote failed: %v)", fetchErr)
return local.manifest, nil
}

// Tier 3b: embedded manifest.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: rollback story for a bad cli-compat.json?

The cache TTL is 1 hour, so a bad manifest poisons all clients for up to an hour after a fix is merged (cache hit on stale-but-fresh cached bad manifest). Is this acceptable, or should there be a way for users to force-refresh? An env var (DATABRICKS_CLI_COMPAT_REFRESH=1) or --refresh-compat flag would give an escape hatch.

Comment on lines +218 to +230
func ResolveAppKitVersion(ctx context.Context) (string, error) {
entry, err := resolveEntry(ctx, build.GetInfo().Version)
if err != nil {
return "", err
}
return entry.AppKit, nil
}

// ResolveAgentSkillsVersion resolves the Agent Skills version for the current CLI.
func ResolveAgentSkillsVersion(ctx context.Context) (string, error) {
entry, err := resolveEntry(ctx, build.GetInfo().Version)
if err != nil {
return "", err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: apps init flag help text uses embedded version, but runCreate calls the network-fetching ResolveAppKitVersion.

These can disagree (help text says default is X, but the actual default is Y). Probably fine, but worth confirming this is intentional and maybe a one-liner in the README to set the expectation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants